10  Data Handling in R

Data Handling in R refers to the importing, managing, transforming, and cleaning of data before analysis. R provides powerful functions and packages such as readr, dplyr, tidyr, and data.table to efficiently handle large datasets.

10.0.1 Data file creation in R

create a data frame using the data.frame() function.

Code
create a data frame
# Create a simple data frame
students <- data.frame(
  Name = c("Arun", "Arun", "Charan", "Divya", "Eswar", 
           "Fathima", "Gopal", "Harini", "Ilango", "Jayanthi"),
  Age = c(25, 25, 35, 28, 22, 40, 33, 27, 31, 29),
  Height = c(5.6, 5.6, NA, 5.4, 6.0, 5.3, 5.9, 5.5, 5.7, 5.8),
  Gender = c("M", "M", NA, "F", "M", "F", "M", "F", "M", "F"),
  Marks = c(85, 85, 78, 88, 76, 95, 82, 89, 80, 91),
  Attendance = c(92, 92, 88, 90, 87, 95, 83, 89, 86, 91)  
)
head(students)
     Name Age Height Gender Marks Attendance
1    Arun  25    5.6      M    85         92
2    Arun  25    5.6      M    85         92
3  Charan  35     NA   <NA>    78         88
4   Divya  28    5.4      F    88         90
5   Eswar  22    6.0      M    76         87
6 Fathima  40    5.3      F    95         95

10.0.2 Importing Data into R

Importing data into R is a crucial step in data analysis. R offers various functions for loading data from different file types.

  • R supports importing data from various sources, including ** Excel, CSV, databases(SQL), JSON, and APIs**.

Import csv file from website

  • Paste the csv website link inside the read.csv() function.
  • Sample csv file website link
Code
# Read csv from website
Diabetes <- read.csv("https://raw.githubusercontent.com/yuvijen/Datasets/refs/heads/main/Diabetes.csv")

# Display the first few rows
head(Diabetes)
  Age Gender  BMI Family_History Physical_Activity      Diet_Type
1  48   Male 35.5             No              High Non-Vegetarian
2  18  Other 28.7            Yes            Medium Non-Vegetarian
3  21  Other 30.0            Yes              High Non-Vegetarian
4  25 Female 25.6             No            Medium     Vegetarian
5  78   Male 38.8             No              High Non-Vegetarian
6  60   Male 19.2             No              High     Vegetarian
  Smoking_Status Alcohol_Intake Stress_Level Hypertension Cholesterol_Level
1          Never           None       Medium          Yes             111.7
2        Current       Moderate         High           No             130.6
3        Current       Moderate         High          Yes             294.8
4         Former       Moderate         High          Yes             159.1
5        Current           High         High           No             215.0
6        Current           None          Low           No             160.2
  Fasting_Blood_Sugar Postprandial_Blood_Sugar HBA1C Heart_Rate Waist_Hip_Ratio
1               141.0                    165.6   8.9         94            0.91
2                83.1                    142.6   5.9         68            0.96
3               159.9                    212.4   4.8         70            0.88
4               133.3                    225.4  11.9         78            0.98
5               164.9                    218.1  11.6         65            0.85
6                77.8                    238.2   4.7         69            0.88
  Urban_Rural Health_Insurance Regular_Checkups
1       Urban               No               No
2       Rural              Yes              Yes
3       Rural               No               No
4       Rural               No               No
5       Urban               No               No
6       Urban               No              Yes
  Medication_For_Chronic_Conditions Pregnancies Polycystic_Ovary_Syndrome
1                                No           0                         0
2                                No           0                         0
3                               Yes           0                         0
4                               Yes           1                        No
5                               Yes           0                         0
6                               Yes           0                         0
  Glucose_Tolerance_Test_Result Vitamin_D_Level C_Protein_Level
1                         124.3            31.5            7.46
2                         151.4            12.5            5.64
3                         106.1            35.8            7.20
4                          85.6            15.4            6.53
5                          77.0            28.6            0.58
6                         180.2            49.0            1.83
  Thyroid_Condition Diabetes_Status
1               Yes             Yes
2               Yes              No
3                No             Yes
4               Yes              No
5                No             Yes
6                No              No

Import csv file from local folder

Sample csv data file - Download the csv dataset from this link.

copy the file path name or (if working directory is set) of the dataset and paste it inside the read.csv("file name")

  • To copy the path name of a file, right click on the file and select copy path name or use the keyboard shortcut:
  • for Windows- ctrl+shift+c for Mac- cmnd+opt+c
Code
# Read CSV file
Indiaagriculture = read.csv("Agri production india.csv")

# Display the first few rows
head(Indiaagriculture)
    Crop          State Cost.of.Cultivation....Hectare..A2.FL
1  ARHAR  Uttar Pradesh                               9794.05
2  ARHAR      Karnataka                              10593.15
3  ARHAR        Gujarat                              13468.82
4  ARHAR Andhra Pradesh                              17051.66
5  ARHAR    Maharashtra                              17130.55
6 COTTON    Maharashtra                              23711.44
  Cost.of.Cultivation....Hectare..C2 Cost.of.Production....Quintal..C2
1                           23076.74                           1941.55
2                           16528.68                           2172.46
3                           19551.90                           1898.30
4                           24171.65                           3670.54
5                           25270.26                           2775.80
6                           33116.82                           2539.47
  Yield..Quintal..Hectare..
1                      9.83
2                      7.47
3                      9.59
4                      6.42
5                      8.72
6                     12.69

Import excel file from local folder

readxl: To read Excel files in R, you need to install the readxl package.

  • The readxl package allows users to import Excel spreadsheets (.xls and .xlsx) into R for further analysis.
  • Supports both .xls (Excel 97-2003) and .xlsx (Excel 2007+) formats.
  • Reads specific sheets, ranges, and named regions.
Install readxl package
Code
Sample data file

Download the excel dataset from this link.

copy the file name (or file path name) of the dataset and paste it inside the function read_excel("file_name")

  • To copy the path name of a file, right click on the file and select copy path name or use the keyboard shortcut:
  • for Windows- ctrl+shift+c for Mac- cmnd+opt+c
Code
library(readxl)
Europeanagriculture = read_excel("Europeanagriculture.xlsx", sheet = 1)
head(Europeanagriculture)
# A tibble: 6 × 24
  Country  farms_number used_agricultural_area_ha standard_output_EUR
  <chr>           <dbl>                     <dbl> <chr>              
1 Belgium         36890                   1354250 8037986420         
2 Bulgaria       202720                   4468500 3842891030         
3 Czechia         26530                   3455410 NA                 
4 Denmark         35050                   2614600 10062442040        
5 Germany        276120                  16715320 49249020560        
6 Estonia         16700                    995100 801547060          
# ℹ 20 more variables: subsistence_semisubsistence_farms <chr>,
#   total_labour_persons <dbl>, total_labour_AWU <dbl>,
#   nonfamily_labour_persons <chr>, nonfamily_labour_AWU <chr>,
#   managers_basic_training <dbl>, managers_only_practical <dbl>,
#   managers_full_training <dbl>, managers_training_NA <chr>,
#   farms_SO_zero <chr>, farms_SO_less2000 <dbl>, `farms_SO_2000-3999` <dbl>,
#   `farms_SO_4000-7999` <dbl>, `farms_SO_8000-14999` <dbl>, …

Import Data from SQL database

  • First download the sql database file from this link
  • Save the sql file in your directory.
  • Install the required packages shown below.

install.packages("DBI") - Database interface package install.packages("RSQLite") - SQLite database driver (for MySQL, use RMariaDB) install.packages("dplyr") - For data manipulation

Code
Install the required packages for sql
Code
Importing Data from Databases (SQL)
# Load required libraries
library(DBI)
library(RSQLite)
library(dplyr)

# Connect to the database
con <- dbConnect(RSQLite::SQLite(), "my_database.sqlite")

# List tables in the database
dbListTables(con)
[1] "employees"
Code
# Import an entire table into R
empdata <- dbReadTable(con, "employees") 

# Display the first few rows
head(empdata)
  id    name age salary
1  1   Alice  25  50000
2  2     Bob  30  60000
3  3 Charlie  35  70000
Code
# Close the database connection
dbDisconnect(con)

10.0.3 Viewing and Exploring Data

Explore the Data

Checking Data Structure
  • View() - To view the entire table in a window
  • head() - View the first few rows
  • str() - Check structure of dataset
  • summary() - Summary statistics
Code
head(students)
     Name Age Height Gender Marks Attendance
1    Arun  25    5.6      M    85         92
2    Arun  25    5.6      M    85         92
3  Charan  35     NA   <NA>    78         88
4   Divya  28    5.4      F    88         90
5   Eswar  22    6.0      M    76         87
6 Fathima  40    5.3      F    95         95
Code
str(students)
'data.frame':   10 obs. of  6 variables:
 $ Name      : chr  "Arun" "Arun" "Charan" "Divya" ...
 $ Age       : num  25 25 35 28 22 40 33 27 31 29
 $ Height    : num  5.6 5.6 NA 5.4 6 5.3 5.9 5.5 5.7 5.8
 $ Gender    : chr  "M" "M" NA "F" ...
 $ Marks     : num  85 85 78 88 76 95 82 89 80 91
 $ Attendance: num  92 92 88 90 87 95 83 89 86 91
Code
summary(students)
     Name                Age           Height         Gender         
 Length:10          Min.   :22.0   Min.   :5.300   Length:10         
 Class :character   1st Qu.:25.5   1st Qu.:5.500   Class :character  
 Mode  :character   Median :28.5   Median :5.600   Mode  :character  
                    Mean   :29.5   Mean   :5.644                     
                    3rd Qu.:32.5   3rd Qu.:5.800                     
                    Max.   :40.0   Max.   :6.000                     
                                   NA's   :1                         
     Marks         Attendance   
 Min.   :76.00   Min.   :83.00  
 1st Qu.:80.50   1st Qu.:87.25  
 Median :85.00   Median :89.50  
 Mean   :84.90   Mean   :89.30  
 3rd Qu.:88.75   3rd Qu.:91.75  
 Max.   :95.00   Max.   :95.00  
                                
Checking Missing Values
Code
# Count missing values per column
colSums(is.na(students))
      Name        Age     Height     Gender      Marks Attendance 
         0          0          1          1          0          0 
Checking Duplicates
Code
# Identify duplicate rows
duplicates <- students[duplicated(students), ]
print(duplicates)
  Name Age Height Gender Marks Attendance
2 Arun  25    5.6      M    85         92

10.0.4 Handling missing values and imputations

Replace missing values with mean
Code
# Create a copy of the original data for data cleaning. 

clean <- students
#  Calculate the mean height (excluding NA)
mean_height <- mean(clean$Height, na.rm = TRUE)

# Replace NA values in Height with the calculated mean
clean$Height[is.na(clean$Height)] <- mean_height

print(clean)
       Name Age   Height Gender Marks Attendance
1      Arun  25 5.600000      M    85         92
2      Arun  25 5.600000      M    85         92
3    Charan  35 5.644444   <NA>    78         88
4     Divya  28 5.400000      F    88         90
5     Eswar  22 6.000000      M    76         87
6   Fathima  40 5.300000      F    95         95
7     Gopal  33 5.900000      M    82         83
8    Harini  27 5.500000      F    89         89
9    Ilango  31 5.700000      M    80         86
10 Jayanthi  29 5.800000      F    91         91
Remove missing values
Code
clean <- na.omit(clean)
print(clean)
       Name Age Height Gender Marks Attendance
1      Arun  25    5.6      M    85         92
2      Arun  25    5.6      M    85         92
4     Divya  28    5.4      F    88         90
5     Eswar  22    6.0      M    76         87
6   Fathima  40    5.3      F    95         95
7     Gopal  33    5.9      M    82         83
8    Harini  27    5.5      F    89         89
9    Ilango  31    5.7      M    80         86
10 Jayanthi  29    5.8      F    91         91

Removing Duplicates

Code
clean <- clean[!duplicated(clean), ]
print(clean)
       Name Age Height Gender Marks Attendance
1      Arun  25    5.6      M    85         92
4     Divya  28    5.4      F    88         90
5     Eswar  22    6.0      M    76         87
6   Fathima  40    5.3      F    95         95
7     Gopal  33    5.9      M    82         83
8    Harini  27    5.5      F    89         89
9    Ilango  31    5.7      M    80         86
10 Jayanthi  29    5.8      F    91         91

Changing Data Types

Code
# Convert column to numeric
clean$Height <- as.numeric(clean$Height)

# Convert column to character
clean$Name <- as.character(clean$Name)
str(clean)
'data.frame':   8 obs. of  6 variables:
 $ Name      : chr  "Arun" "Divya" "Eswar" "Fathima" ...
 $ Age       : num  25 28 22 40 33 27 31 29
 $ Height    : num  5.6 5.4 6 5.3 5.9 5.5 5.7 5.8
 $ Gender    : chr  "M" "F" "M" "F" ...
 $ Marks     : num  85 88 76 95 82 89 80 91
 $ Attendance: num  92 90 87 95 83 89 86 91
 - attr(*, "na.action")= 'omit' Named int 3
  ..- attr(*, "names")= chr "3"

10.0.5 Data Manipulation with dplyr

The dplyr package provides functions for filtering, selecting, modifying, and restructuring data.

Selecting Specific Columns

  • Install dplyr package
Code
Code
library(dplyr)

# Select specific columns
clean <- clean %>% select(Name,Age, Height, Gender, Marks)
print(clean)
       Name Age Height Gender Marks
1      Arun  25    5.6      M    85
4     Divya  28    5.4      F    88
5     Eswar  22    6.0      M    76
6   Fathima  40    5.3      F    95
7     Gopal  33    5.9      M    82
8    Harini  27    5.5      F    89
9    Ilango  31    5.7      M    80
10 Jayanthi  29    5.8      F    91

Removing Columns

Code
# Remove specific columns
clean <- clean %>% select(-Gender)
print(clean)
       Name Age Height Marks
1      Arun  25    5.6    85
4     Divya  28    5.4    88
5     Eswar  22    6.0    76
6   Fathima  40    5.3    95
7     Gopal  33    5.9    82
8    Harini  27    5.5    89
9    Ilango  31    5.7    80
10 Jayanthi  29    5.8    91

Filtering Data

Code
# Filter rows where column value is greater than 100
clean <- clean %>% filter(clean$Marks >= 80)
print(clean)
      Name Age Height Marks
1     Arun  25    5.6    85
2    Divya  28    5.4    88
3  Fathima  40    5.3    95
4    Gopal  33    5.9    82
5   Harini  27    5.5    89
6   Ilango  31    5.7    80
7 Jayanthi  29    5.8    91

Sorting Data

Code
# Arrange data in descending order
clean <- clean %>% arrange(desc(clean$Marks))
print(clean)
      Name Age Height Marks
1  Fathima  40    5.3    95
2 Jayanthi  29    5.8    91
3   Harini  27    5.5    89
4    Divya  28    5.4    88
5     Arun  25    5.6    85
6    Gopal  33    5.9    82
7   Ilango  31    5.7    80

10.0.6 Reshaping Data with tidyr

Converting Wide Data to Long Format

Code
library(tidyr)

clean <- clean %>% pivot_longer(cols = c("Age", "Height", "Marks"), names_to = "Details", values_to = "Values")
print(clean)
# A tibble: 21 × 3
   Name     Details Values
   <chr>    <chr>    <dbl>
 1 Fathima  Age       40  
 2 Fathima  Height     5.3
 3 Fathima  Marks     95  
 4 Jayanthi Age       29  
 5 Jayanthi Height     5.8
 6 Jayanthi Marks     91  
 7 Harini   Age       27  
 8 Harini   Height     5.5
 9 Harini   Marks     89  
10 Divya    Age       28  
# ℹ 11 more rows

Converting Long Data to Wide Format**

Code
clean <- clean %>% pivot_wider(names_from = "Details", values_from = "Values")
print(clean)
# A tibble: 7 × 4
  Name       Age Height Marks
  <chr>    <dbl>  <dbl> <dbl>
1 Fathima     40    5.3    95
2 Jayanthi    29    5.8    91
3 Harini      27    5.5    89
4 Divya       28    5.4    88
5 Arun        25    5.6    85
6 Gopal       33    5.9    82
7 Ilango      31    5.7    80

10.0.7 Exporting Data from R

Export the data file as a csv file to local folder

The write.csv() function in R is used to export a dataset into a Comma-Separated Values (CSV) file, making it easy to save, share, and analyze data in other tools like Excel, Python, or SQL.

write.csv(data, "folder path/filename.csv") To save the CSV in a specific folder, provide the path.

Code
export the data frame as csv to local
# Export the data frame as csv
write.csv(clean, "cleanstudentsdata.csv")

Export data file as a excel file to local folder

writexl: Writing Data to Excel Files

The writexl package provides an easy way to export data from R into an Excel file without requiring external dependencies.

Key Features:

  • Writes .xlsx files quickly.
  • Preserves column types and formats.
Install writexl package
Code
install writexl package
install.packages("writexl")
  • export the clean data file as xlsx file.
Code
library(writexl)
# Write data to an Excel file
write_xlsx(clean, "cleanstudentsdata.xlsx")

To check the list of files present in the directory (or project)

Code
check the list of files in the directory